Skip to content

Dynamic hardware detection#71

Merged
divyashreepathihalli merged 11 commits intokeras-team:mainfrom
divyashreepathihalli:simplify-hw-names
Mar 9, 2026
Merged

Dynamic hardware detection#71
divyashreepathihalli merged 11 commits intokeras-team:mainfrom
divyashreepathihalli:simplify-hw-names

Conversation

@divyashreepathihalli
Copy link
Collaborator

@divyashreepathihalli divyashreepathihalli commented Mar 5, 2026

This PR enhances the hardware parsing capabilities of keras-remote, allowing users to specify generic hardware requests (e.g., gpu-16, tpu-512) and automatically provisioning the most appropriate accelerator. It removes the strict requirement for users to know exact GKE hardware topologies, improving the overall developer experience while maintaining strict backward compatibility.

  • Dynamic GPU & TPU Fallback: Added logic to dynamically query the hardware registry for generic strings (e.g., matching gpu-N and tpu-N).
  • Generation-Aware Prioritization: Overhauled the fallback search to iterate through a predefined list of preferred hardware (_PREFERRED_GPUS and _PREFERRED_TPUS). This ensures requests like tpu-512 automatically provision the newest available generation (e.g. v4 or v5p over v2), rather than failing back to deprecated hardware based on dictionary insertion order.
  • Canonical Accelerator Aliases: Added native regex alias support for names like v5e and ghostlite, mapping them cleanly back to the canonical v5litepod representation under the hood. This ensures backend Kubernetes node pools are labeled consistently (tpu-v5litepod-xxxx) instead of creating fragmented node pool topologies.
  • Topology & Regex Hardening: Cleaned up overlapping regex matches (e.g., _MULTI_GPU_RE matching TPU strings) to improve parsing efficiency and prevent edge-case false positives. All associated unit tests in accelerators_test.py have been updated to assert proper canonical alias resolution.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the hardware parsing and provisioning capabilities within keras-remote. It allows users to specify hardware requirements more abstractly, such as "gpu-4" or "tpu-512", and intelligently selects the most suitable accelerator, prioritizing newer generations. This change streamlines the developer experience by abstracting away complex GKE hardware topologies while maintaining backward compatibility and expanding the range of supported hardware configurations.

Highlights

  • Dynamic Hardware Provisioning: Introduced logic to automatically provision GPUs and TPUs based on generic requests like "gpu-16" or "tpu-512", removing the need for exact GKE topology knowledge.
  • Generation-Aware Prioritization: Implemented a preference-based fallback mechanism for TPUs and GPUs, ensuring that newer hardware generations are prioritized when generic requests are made.
  • Canonical Accelerator Aliases: Added support for aliases like "v5e" and "ghostlite" to map to canonical TPU names, ensuring consistent backend labeling.
  • Expanded Hardware Support: Updated GPU and TPU specifications to include new hardware types (P4, P100) and expanded available counts/topologies for existing ones (L4, A100, V2, V3, V4, V5litepod).
  • Improved Parsing Robustness: Refined regex patterns for accelerator parsing to prevent false positives and enhance efficiency, along with comprehensive unit test updates.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • keras_remote/core/accelerators.py
    • Expanded supported GPU counts for L4, A100, and A100-80GB.
    • Added P4 and P100 GPUs to the GPUS registry.
    • Extended TPU topology options for V2, V3, and V5litepod.
    • Introduced V4 TPU specifications with various topologies.
    • Defined _TPU_ALIASES for "v5e" and "ghostlite" to map to "v5litepod".
    • Added DEFAULT_GPU, DEFAULT_TPU, _PREFERRED_GPUS, and _PREFERRED_TPUS constants.
    • Modified regex patterns (_MULTI_GPU_RE, _TPU_CHIPS_RE, _TPU_TOPO_RE) to be more flexible and accurate.
    • Implemented new parsing logic in parse_accelerator for generic "cpu-N", "gpu", "tpu", "gpu-N", and "tpu-N" requests, including generation-aware fallback.
    • Updated parse_accelerator to correctly resolve TPU aliases for bare names, chip counts, and topology strings.
  • keras_remote/core/accelerators_test.py
    • Added a test case for parsing "l4-2" GPU.
    • Updated the invalid GPU count test from "l4x8" to "l4x16" to reflect new L4 counts.
    • Added a test for parsing "cpu-8".
    • Introduced new test cases for generic "gpu" and "tpu" requests.
    • Added tests for generic "gpu-N" and "tpu-N" requests, including dynamic fallback scenarios (e.g., "gpu-16" falling back to A100, "tpu-4096" falling back to V4).
    • Added tests to verify correct resolution of "v5e" and "ghostlite" TPU aliases.
    • Included tests for unsupported generic GPU and TPU counts to ensure proper error handling.
Activity
  • No specific activity (comments, reviews, progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement to hardware parsing by allowing dynamic and generic accelerator requests, which greatly enhances the user experience. The generation-aware fallback and support for canonical aliases are excellent additions. My review focuses on a couple of opportunities to refine the implementation by reducing redundancy and code duplication in the new parsing logic, which will improve long-term maintainability. The accompanying test updates are thorough and provide good coverage for the new features.

@divyashreepathihalli
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces dynamic hardware detection, which is a great enhancement for user experience, aligning well with the API design guidelines of minimizing cognitive load. The changes are extensive, adding support for generic requests like gpu-N and tpu-N, introducing aliases for TPU types, and expanding the hardware registry. The accompanying tests are thorough. I found a logic issue in the implementation of generation-aware prioritization. The search order for accelerators does not correctly prioritize the newest hardware generations as intended. My review includes suggestions to correct this.

@divyashreepathihalli
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable enhancement for dynamic hardware detection, simplifying accelerator requests for users. The changes are well-structured and include comprehensive tests for the new functionality. My review focuses on two main points: improving the robustness of the parsing logic to prevent potential future conflicts between GPU and TPU naming, and increasing the precision of a unit test to better reflect the deterministic nature of the new hardware selection logic. Overall, this is a great improvement to the user experience.


_TPU_ALIASES: dict[str, str] = {
"v5e": "v5litepod",
"ghostlite": "v5litepod",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think *fish names are okay to use externally

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi: The fish names are exposed externally at https://cloud.google.com/skus/sku-groups/vertex-prediction.

Copy link
Collaborator

@JyotinderSingh JyotinderSingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Left a few comments.


_TPU_ALIASES: dict[str, str] = {
"v5e": "v5litepod",
"ghostlite": "v5litepod",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi: The fish names are exposed externally at https://cloud.google.com/skus/sku-groups/vertex-prediction.

Comment on lines +170 to +171
_MULTI_GPU_RE = re.compile(r"^(.+?)(?:x|-)(\d+)$") # "a100x4", "l4-2"
_TPU_CHIPS_RE = re.compile(r"^([a-z0-9_]+)-(\d+)$") # "v3-8", "v5litepod-16"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_MULTI_GPU_RE = r"^(.+?)(?:x|-)(\d+)$" matches any name-number pattern, which is same shape as _TPU_CHIPS_RE = r"^([a-z0-9_]+)-(\d+)$".

eg., "l4-2" matches both regexes. The reason it works is that Multi-GPU is moved to the end of the function and TPU falls through (since "l4" is not in TPUS).

If anyone reorders the checks in the future, TPU parsing will intercept GPU (with dash) strings or vice versa.

Do you think we should let it be for now, or should we use this opportunity to implement what was discussed offline to utilize tpu:.. and gpu:... prefixes for accelerator name, which also helps popularise the TPU branding.

Copy link
Collaborator Author

@divyashreepathihalli divyashreepathihalli Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea. I've updated the core parsing logic to fully support and encourage explicit gpu: and tpu: prefixes. However, I also retained fallback parsing so that legacy unprefixed strings (like v5litepod or l4) continue to work as before!

@JyotinderSingh
Copy link
Collaborator

JyotinderSingh commented Mar 7, 2026

gcp.container.NodePoolNodeConfigGuestAcceleratorArgs(
type=gpu.gke_label,
count=1,
),

Just a note, we presently create all GPU node pools with count=1. Would leave it to be your call if you'd like to fix it as part of this change.

Presently, GpuSpec stores a single machine_type for all counts. We need to follow the TpuSpec model which maps each count to its own machine type.

We only need to map the int count to an str machine_type (instead of requiring a GpuTopology class), since only machine_type varies per count.

Something like:

@dataclass(frozen=True)
class GpuSpec:
  gke_label: str
  counts: dict[int, str]    # count -> machine_type

Then the GPUS registry will need to be updated as:

GPUS = {
  "a100": GpuSpec("nvidia-tesla-a100", {
    1: "a2-highgpu-1g",
    2: "a2-highgpu-2g",
    # ...
  }),
  "h100": GpuSpec("nvidia-h100-80gb", {
    1: "a3-highgpu-1g",
    # ...
  }),
# ...
}

Then we can update the guest_accelerators count in node pool creation:

guest_accelerators=[
  gcp.container.NodePoolNodeConfigGuestAcceleratorArgs(
    type=gpu.gke_label,
    count=gpu.count,
  ),
],

@divyashreepathihalli
Copy link
Collaborator Author

gcp.container.NodePoolNodeConfigGuestAcceleratorArgs(
type=gpu.gke_label,
count=1,
),

Just a note, we presently create all GPU node pools with count=1. Would leave it to be your call if you'd like to fix it as part of this change.

Presently, GpuSpec stores a single machine_type for all counts. We need to follow the TpuSpec model which maps each count to its own machine type.

We only need to map the int count to an str machine_type (instead of requiring a GpuTopology class), since only machine_type varies per count.

Something like:

@dataclass(frozen=True)
class GpuSpec:
  gke_label: str
  counts: dict[int, str]    # count -> machine_type

Then the GPUS registry will need to be updated as:

GPUS = {
  "a100": GpuSpec("nvidia-tesla-a100", {
    1: "a2-highgpu-1g",
    2: "a2-highgpu-2g",
    # ...
  }),
  "h100": GpuSpec("nvidia-h100-80gb", {
    1: "a3-highgpu-1g",
    # ...
  }),
# ...
}

Then we can update the guest_accelerators count in node pool creation:

guest_accelerators=[
  gcp.container.NodePoolNodeConfigGuestAcceleratorArgs(
    type=gpu.gke_label,
    count=gpu.count,
  ),
],

Good catch! I have refactored GpuSpecto mirror TpuSpec's behavior. It now uses a counts: dict[int, str] dictionary to map explicit integer counts directly to their corresponding optimal GKE machine type (e.g. mapping l4 count 8 perfectly to g2-standard-96). I updated keras_remote/cli/infra/program.py so guest_accelerators dynamically injects count=gpu.count instead of hardcoding 1.

@divyashreepathihalli
Copy link
Collaborator Author

FYI: The local e2e tests are passing.

@divyashreepathihalli divyashreepathihalli merged commit 8f3495a into keras-team:main Mar 9, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants